class: title-slide, left, bottom <img src="data:image/png;base64,#img/session06/dplyr_wrangling.PNG" width="40%"/> # Data wrangling with dplyr ---- ## **Session 6** ### ### .right-column[ ] .footnote[Artwork by @allison_horst] ??? This session is about data wrangling with dplyr. I think this is the most important session - so do try and stay with it if you can. Dplyr is probably the one package I'll use for every R Project, and if you can be good at dplyr then you'll save yourself loads of time in the future. --- class: center, middle # Wrangling </br> Reshaping or transforming data </br> into a format which is easier to work with </br> (…for later visualisation, modelling, </br> or computing of statistics…) ??? So dplyr is a package to help with data wrangling. When we say wrangling, we essentially mean the process of manipulating data into the format that we want it in. So after you load and maybe clean the data, you'll want to manipulate it into a certain format to help you with subsequent analysis or plotting. --- # A note on "tidy" data Tidyverse functions work best with tidy data: 1. Each variable forms a column. 2. Each observation forms a row. </br> (Broadly, this means long rather than wide tables) ??? Now the tidyverse suite of packages, as implied by the name, encourage data to be 'tidy' for subsequent working. This means that data should be in the form where each variable forms a column, and each observation forms a row. If you've worked with SQL, access or other databases, that shouldn't be new to you, but it's worth keeping in mind as the goal you should aim for. --- class: center, middle # The tool: dplyr package .blue[(dee-ply-r)] is a language for data manipulation </br> </br> Most wrangling puzzles can be solved with </br> knowledge of just 5 dplyr verbs .blue[(5 functions)]. </br> </br> These verbs will be the subject of this session. ??? The way dplyr works is that it's structured around a series of functions or 'verbs' which perform common wrangling tasks, such as reordering, creating new variables, filtering and so on. This, together with the way the code is written, means that complex data manipulation can be concisely and easily expressible by chains of these verbs, as I'll come on to. --- # Project 2: ## Exploring Mental Health (MH) Inpatient Capacity -- </br> The following is some analysis of Mental Health inpatient capacity in England. </br> As part of this, we will be looking at the changes in the number (and occupancy) of MH beds available. -- ### Background Maintaining clinical effectiveness and safety when a ward is fully occupied is a serious challenge for staff. </br> Inappropriate out of area placements have an added cost and also mean patients are separated from their social support networks. ??? The worked example for this topic is to manipulate some data on mental health beds. As everyone is no doubt aware, the number and availability of such beds is a matter of concern, so we'll see what we can glean from some public data --- # The Data: </br> KH03 returns (bed numbers and occupancy) by organisation, published by NHS England. </br> Scraped from the NHSE statistics website: </br> https://www.england.nhs.uk/statistics/statistical-work-areas/bed-availability-and-occupancy/bed-data-overnight/ .blue[partially cleaned] ??? This data is taken from NHSI/E's statistical work areas website - it's public data, aggregated and non-identifiable. NHSI/E do some cleaning of it, but not everything that we'd want. --- # Start a new script File/New script or shortcut keys <kbd>Ctrl + Shift + N </kbd> .left-column[ Load the data beds_data.csv ] .right-column[ <img class="center" src="data:image/png;base64,#img/session06/new-script.PNG"/> ] ??? Ok, so to follow along, create a new R Script, save the beds_data.csv file that was previously distributed in your working directory, then load it in using the import wizard. <pause and ask if anyone having difficulty> --- # Less friendly csvs <img class="center" src="data:image/png;base64,#img/session06/beds-data-wizard.PNG"/> ??? Now, what you'll see initially using the import wizard is that there's a header in the file which is causing an issue, pushing our column names down --- # Less friendly csvs <img class="center" src="data:image/png;base64,#img/session06/beds-data-wizard-2.png"/> ??? We need to get rid of that metadata --- # Less friendly csvs Note you will have to move the cursor to another area for the number to have an effect. <img class="center" src="data:image/png;base64,#img/session06/beds-data-wizard-skip.PNG"/> ??? To fix this, you add the number of rows to skip in this box here, which is 3, then click elsewhere which should hopefully fix things --- # Less friendly csvs <img class="center" src="data:image/png;base64,#img/session06/beds-data-wizard-date.png"/> ??? As well as this, we have an issue that RStudio thinks that the date field is a character column, when we want to use it as a date. --- # Less friendly csvs <img class="center" src="data:image/png;base64,#img/session06/beds-data-date-wizard.PNG"/> ??? So to do this, change the drop down on the column heading to date. It'll pop up a window asking about what format the date is in. The default is US, so we want to tell it that the data follows UK date conventions (dmy). So change the m to a d and the d to an m and click ok in the pop up. Then follow the method we did before, pasting the code into the R Script and running that, to import the data. I'll demo this now to show you --- # Look at the data This is real data so there are real issues (which we'll work with) <table> <thead> <tr> <th style="text-align:left;"> date </th> <th style="text-align:left;"> org_code </th> <th style="text-align:left;"> org_name </th> <th style="text-align:right;"> beds_av </th> <th style="text-align:right;"> occ_av </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1A </td> <td style="text-align:left;"> Worcestershire Health And Care </td> <td style="text-align:right;"> 129 </td> <td style="text-align:right;"> 117 </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1C </td> <td style="text-align:left;"> Solent </td> <td style="text-align:right;"> 105 </td> <td style="text-align:right;"> 82 </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1E </td> <td style="text-align:left;"> Staffordshire And Stoke On Trent Partnership </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1F </td> <td style="text-align:left;"> Isle Of Wight </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 42 </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1H </td> <td style="text-align:left;"> Barts Health </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> 2013-09-01 </td> <td style="text-align:left;"> R1J </td> <td style="text-align:left;"> Gloucestershire Care Services </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> ??? Ok, so looking at the data, there are some issues - notably the fact that there are lots of "NA" values. These are trusts that don't have any mental health beds, usually because they're not mental health providers. Having NAs in the data makes it unwieldy, so that's something we'll need to fix in due course. --- # Dplyr 5 key verbs will help us gain a deeper understanding of our data sets. Note summarise() can also be spelt summarize() ```r dplyr::arrange() dplyr::filter() dplyr::mutate() dplyr::group_by() dplyr::summarise() ``` ??? So how do we fix this in dplyr? So in dplyr there are 5 main functions or verbs, arrange, filter, mutate, group_by, and summarise. Note that summarise can be spelt either way. The main author of the tidyverse is a New Zealander, so British English spellings work too. --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Starts with an...] </br> potato then </br> peel then </br> slice into medium sized pieces then </br> boil for 25 minutes then </br> mash ??? Now, to illustrate how these work, I'm going to draw the analogy with a cooking recipe. In a cooking recipe, for example for mashed potato, you're following a series of steps and carrying out those steps sequentially, to get the outcome you want. --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Start with an R object] </br> **potato** then </br> peel then </br> slice into medium sized pieces then </br> boil for 25 minutes then </br> mash ??? So, you start with the raw data, or an R object, in this analogy, the raw potato.... --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Start with an R object] </br> **potato** then </br> **peel()** then </br> slice into medium sized pieces then </br> boil for 25 minutes then </br> mash ??? Then you perform the actions, so first you need to peel it. In the case of writing code, each of these action steps is a function or dplyr verb. So peel might be a peel() function..... --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Start with an R object] </br> **potato** then </br> **peel()** then </br> **slice(size = "medium")** then </br> boil for 25 minutes then </br> mash ??? Slice might take an argument specifying exactly how thin you want the slices..... --- # Building with steps These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato: .blue[Start with an R object] </br> **potato** then </br> **peel()** then </br> **slice(size = "medium")** then </br> **boil(time = 25)** then </br> .blue[output = mashed potato] ??? The boil action might take an argument for how long for..... --- # Building with steps .grey[These verbs aren't used independently of each other. Each can be a step in the code, like a recipe for mashed potato:] .blue[input object] </br> **potato** %>% </br> **peel()** %>% </br> **slice(size = "medium")** %>% </br> **boil(time = 25)** .blue[output = mashed potato] ??? And run start to finish, you get your mashed potato. Note that dplyr verbs are normally strung together with this %>% symbol, between each step. --- # Tidyverse syntax .blue[input object] data_frame .blue[then] </br>.green[do_this(]rules.green[)] .blue[then] </br>.green[do_this(]rules.green[)] .blue[output = new data frame] ??? So to generalise, what the syntax will typically look like, is a dataframe at the top (or another R object) then a series of functions amending it to get to the output you want. --- # Tidyverse syntax </br> data_frame .blue[%>%] </br>.green[do_this(]rules.green[)] .blue[%>%] </br>.green[do_this(]rules.green[)] ??? The steps are chained together over several lines with this %>% symbol. --- # Tidyverse syntax </br> Combine simple pieces to solve complex puzzles data_frame .blue[%>%] </br>.green[do_this(]rules.green[)] .blue[%>%] </br>.green[do_this(]rules.green[)] ??? In this way, you can have quite complex operations made up of little steps, simple enough on their own, but powerful when combined. --- class: inverse, middle, center #Using dplyr ??? Ok, lets actually try some of this out --- class: inverse, center, middle # Q1. Which organisation provided the highest number of Mental Health (MH) beds? ??? One question we might be interested in is which org had the most mental health beds? --- .blue[# 1. arrange] Reorder rows based on select variable </br> <img class="center" src="data:image/png;base64,#img/session06/arrange-verb-code.PNG" width="40%"/> ```r beds_data %>% arrange(beds_av) ``` -- </br> Shortcut for %>% = <kbd>Ctrl + Shift + m</kbd> ??? To do that, we're going to feed the beds data frame into the arrange function, which re-orders the dataframe. In the brackets, we tell it which variable we want to order it by A nice shortcut for %>% is control shift and m. Can't remember if this works in cloud or not, so if you are using the cloud and it doesn't you might have to type it out, percent greater-than, percent. --- .blue[# 1. arrange] Reorder rows based on select variable... defaults to ascending order ```r beds_data %>% arrange(beds_av) ``` ``` ## # A tibble: 4,558 × 5 ## date org_code org_name beds_av occ_av ## <date> <chr> <chr> <dbl> <dbl> ## 1 2018-09-01 RAL Royal Free London 2 2 ## 2 2018-06-01 RAL Royal Free London 4 4 ## 3 2015-09-01 RAL Royal Free London 5 2 ## 4 2013-09-01 RAL Royal Free London 6 1 ## 5 2013-12-01 RAL Royal Free London 6 3 ## 6 2014-03-01 RAL Royal Free London 6 2 ## 7 2014-06-01 RAL Royal Free London 6 3 ## 8 2014-09-01 RAL Royal Free London 6 2 ## 9 2014-12-01 RAL Royal Free London 6 2 ## 10 2015-03-01 RAL Royal Free London 6 3 ## # … with 4,548 more rows ``` ??? The default behaviour is to sort ascending, so if you do that step in the console, you'll get something like this --- .blue[# 1. arrange] We need descending order: ```r beds_data %>% * arrange(desc(beds_av)) ``` ``` ## # A tibble: 4,558 × 5 ## date org_code org_name beds_av occ_av ## <date> <chr> <chr> <dbl> <dbl> ## 1 2013-09-01 RHA Nottinghamshire Healthcare 1050 929 ## 2 2013-12-01 RHA Nottinghamshire Healthcare 1002 888 ## 3 2014-03-01 RHA Nottinghamshire Healthcare 993 875 ## 4 2014-06-01 RHA Nottinghamshire Healthcare 991 857 ## 5 2015-03-01 RHA Nottinghamshire Healthcare 977 821 ## 6 2014-09-01 RHA Nottinghamshire Healthcare 971 844 ## 7 2014-12-01 RHA Nottinghamshire Healthcare 939 829 ## 8 2015-12-01 RWK East London 925 757 ## 9 2015-09-01 RWK East London 919 762 ## 10 2016-09-01 RWK East London 916 770 ## # … with 4,548 more rows ``` </br> desc() works for text and numeric variables ??? To sort descending, you need to wrap the sort variable inside a desc() function And if you do that, you see that Nottinghamshire Healthcare had the most mental health beds. But of course this data is monthly, so really we want a more specific question --- class: inverse, center, middle # Q2. Which 2 organisations provided the highest number of MH beds in September 2018? ??? So lets pick a specific month to look at, and exclude observations from other month-year pairs. --- class: center, middle # .darkgrey[Q2. Which 2 organisations provided the .blue[highest number] of MH beds in September 2018?] .pull-left[ We'll use arrange() as before] ??? We'll need arrange(), but something else as well --- class: center, middle # .darkgrey[Q2. Which 2 organisations provided the .blue[highest number] of MH beds .blue[in September 2018?]] .pull-left[ .darkgrey[We'll use arrange() as before] </br> </br> But we require only observations with this date ] ??? We need something which will return only values from September 2018 --- .blue[# 2. filter] pick observations by their value <img class="center" src="data:image/png;base64,#img/session06/filter-verb-code.PNG" width="40%"/> -- ## == is a test of equality ```r beds_data %>% * filter(date == "2018-09-01") ``` The expression inside brackets should return TRUE or FALSE. We are choosing rows where this expression is TRUE. To exclude and test where the expression is NOT equal `!=` ```r beds_data %>% * filter(date != "2018-09-01") ``` ??? The verb we want here is filter() And to use it, we pass in the condition we want. For a test of equality, we need to use two equals signs, and not equal to is an exclamation mark and an equals sign --- .blue[# 2. filter] pick observations by their value ```r beds_data %>% arrange(desc(beds_av)) %>% * filter(date == "2018-09-01") ``` ??? So to answer our question, we'll chain a filter verb onto the arrange function, and see what result we get <ask if anyone has the answer, if not demo myself> --- class: inverse, center, middle # Q3. Which 5 organisations had the highest percentage bed occupancy in September 2018? ??? Now we'll consider a slightly more focussed question, looking at the percentage occupancy --- class: center, middle # .darkgrey[Q3. Which 5 organisations had the .blue[highest] percentage bed occupancy in September 2018?] We'll use arrange() as before ??? We'll keep arrange from before --- class: center, middle # .darkgrey[Q3. Which 5 organisations had the .blue[highest] percentage bed occupancy in .blue[September 2018?]] .darkgrey[We'll use arrange() as before] </br> We'll use filter() as before ??? and also filter --- class: center, middle # .darkgrey[Q3. Which 5 organisations had the highest .blue[percentage bed occupancy] in September 2018?] .darkgrey[We'll use arrange() as before] </br> .darkgrey[We'll use filter() as before] </br> We don't have this variable... -- but we can create it ??? Now we don't have a percentage occupancy variable We need to create it... --- .blue[# 3. mutate] create .green[new variables] from .green[existing ones] ```r beds_data %>% * mutate(perc_occ = occ_av / beds_av) ``` </br> perc_occ is a new named column from existing occ_av and beds_av columns -- ## = is NOT a test of equality ??? The verb to create new variables is mutate. It's the least intuitively named of the verbs, but one that you'll use often enough that you'll remember it. We name the new variable, then say how it's defined. In this case it's the occupied number of beds divided by the total number of beds, to give an occupancy percentage. --- .blue[# 3. mutate] create .green[new variables] from .green[existing ones] ```r beds_data %>% * mutate(perc_occ = occ_av / beds_av) %>% filter(date == "2018-09-01") %>% * arrange(desc(perc_occ)) ``` We can refer to variables we've just created above ??? So we can slot that all together. So here we first create the variable (though we could have done it after the filter), then filter, then arrange according to the new variable descending <ask for answer, and if anyone struggled - demo if so> --- class: inverse, center, middle # Q4. What was the mean number of beds, (across all trusts) for each value of date? ??? Ok, lets consider a more complex question about what the mean average bed numbers were at each month --- class: center, middle # .darkgrey[Q4. What was the .blue[mean number of beds], (across all trusts) for each value of date?] Let's first look at how we'd produce summary statistics like a mean... ??? So we probably need to create a mean at some point, but we also need to add together values - so we can't just use mutate easily --- .blue[# 4. summarise] collapse many values into a single summary value ```r beds_data %>% * summarise(means_beds = mean(beds_av)) ``` </br> Similar in syntax to mutate: means_beds is the new column name </br> mean() function uses the existing column beds_av -- The result is NA ``` ## # A tibble: 1 × 1 ## means_beds ## <dbl> ## 1 NA ``` ??? What we want to use is the summarise verb, which collapses the data frame and generates summary statistics for all observations. Here we are telling dplyr we want to summarise the dataframe with a summary variable means_beds, which is defined as the mean of available beds. Unfortunately, the result is NA. This is because we have missing variables, coded as NA in the dataframe --- .blue[# 4. summarise] collapse many values into a single summary value ```r beds_data %>% summarise(means_beds = mean(beds_av, * na.rm = TRUE)) ``` </br> We'll need to remove NA values to get a suitable mean. TRUE can also be T ??? What we can do in this circumstance is to tell summarise to ignore the NA values, which we do by setting the na.rm argument to TRUE (all caps). You can use a capital T instead, but I'd advise against it. Your code is slightly more secure by writing it in full, and it's more readable as well --- .blue[# 4. summarise] collapse many values into a single summary value ```r beds_data %>% summarise(means_beds = mean(beds_av, * na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## means_beds ## <dbl> ## 1 300. ``` This code produces a single summary value for the whole dataset ??? And if you run that, you see that the result is an average of 300 beds for the whole dataset. --- class: inverse, middle, center # Q4. What was the mean number of beds (across all trusts) for each value of date? ??? Ok, this time we want to know the mean number of beds (across all organisations) in each month? --- class: middle, center #.darkgrey[Q4. What was the .blue[mean number of beds] (across all trusts) for each value of date?] .blue[Now we know how to use summarise...] ??? We can use summarise, but this time.... --- class: middle, center #.darkgrey[Q4. What was the mean number of beds (across all trusts) .blue[for each value of date?]] .darkgrey[Now we know how to use summarise...] </br> .blue[We'll produce a summary value for **each value of date**] ??? We need to ensure we get a summary for each date, not just overall. And for that, we need a verb called group_by. --- .blue[# 5. group_by] For each group... ```r beds_data %>% * group_by(date) %>% summarise(mean_beds = mean(beds_av, na.rm = TRUE)) ``` For each value of date... </br> something must follow a group_by() as it does nothing to the output alone. The change occurs behind the scenes. ??? What we do, is before the summarise verb, we add in a group_by (group underscore by), passing the variable we want to compute summary values for. Then summarise knows that we want to compute summarise by group --- .blue[# 5. group_by] For each group... .darkgrey[summarise() produces a single summary value] ```r beds_data %>% * group_by(date) %>% summarise(mean_beds = mean(beds_av, na.rm = TRUE)) ``` ``` ## # A tibble: 21 × 2 ## date mean_beds ## <date> <dbl> ## 1 2013-09-01 324. ## 2 2013-12-01 322. ## 3 2014-03-01 319. ## 4 2014-06-01 320. ## 5 2014-09-01 318. ## 6 2014-12-01 315. ## 7 2015-03-01 314. ## 8 2015-06-01 294. ## 9 2015-09-01 296. ## 10 2015-12-01 296. ## # … with 11 more rows ``` ??? And here's the result - computed mean values for each date, giving us a time series of mean beds --- .blue[# 5a. ungroup] Often it's safest to remove the grouping after you've performed the required operation <img class="center" src="data:image/png;base64,#img/session06/group_by_ungroup.png" width="40%"/> ```r beds_data %>% group_by(date) %>% summarise(mean_beds = mean(beds_av, na.rm = TRUE)) %>% ungroup() ``` ??? A technical note - often it's best to 'ungroup' using the ungroup function after you've done what you need to. That's because it's easy to forget you've grouped it, and you might get strange values in subsequent operations --- class: inverse, middle, center # Q5. Which 5 organisations have the highest mean % bed occupancy? </br> -- # (over the 5 year period) ??? OK, for the last question, you're going to have a go from scratch - We're going to look at which orgs have the highest mean occupancy percentage, over the 5 year period --- class: middle, center # .darkgrey[Q5. Which 5 organisations have the highest mean .blue[% bed occupancy]?] .blue[Create a new variable mutate()] ??? To help you out, I can outline the steps. First you'll need to use mutate to create a new variable --- class: middle, center # .darkgrey[Q5. Which 5 .blue[organisations] have the highest mean % bed occupancy?] .darkgrey[Create a new variable mutate()] </br> </br> .blue[Then, for each of the organisations group_by()] ??? Then use a group_by --- class: middle, center # .darkgrey[Q5. Which 5 organisations have the highest .blue[mean] % bed occupancy?] .darkgrey[Create a new variable mutate() </br> </br> Then, for each of the organisations group_by() </br> </br> .blue[Summary statistic using summarise()]] ??? Then summarise --- class: middle, center # .darkgrey[Q5. Which 5 organisations have the .blue[highest] mean % bed occupancy?] .darkgrey[Create a new variable mutate() </br> </br> Then, for each of the organisations group_by() </br> </br> Summary statistic using summarise() </br> </br> .blue[Order to find highest by using arrange()]] ??? And finally order with arrange --- class: middle, center # .darkgrey[Q5. Which 5 organisations have the .blue[highest] mean % bed occupancy?] .darkgrey[Create a new variable mutate() </br> </br> Then, for each of the organisations group_by() </br> </br> Summary statistic using summarise() </br> </br> Order to find highest by using arrange()] </br> .pull-left[ .blue[Tip: Run the code after each new line to check it returns the output you'd expect.] ] ??? Try running the code after each stage, so you know it works at each step --- class: inverse, middle, center # Over to you... ??? Ok, I'll give you a few minutes, then I'll show you the answer --- class: middle, center # Q5. Which 5 organisations have the highest mean % bed occupancy? Create a new variable mutate() </br> </br> Then, for each of the organisations group_by() </br> </br> Summary statistic using summarise() </br> </br> Order to find highest by using arrange() </br> .pull-left[ .blue[Tip: Run the code after each new line to check it returns the output you'd expect.] ] .pull-right[ Hint: ```r beds_data %>% mutate() %>% group_by() %>% summarise() %>% arrange() ``` ] ??? Wait here for a couple of mins --- # Solution ```r beds_data %>% mutate(perc_occ = occ_av/beds_av) %>% group_by(org_name) %>% summarise(mean_pocc = mean(perc_occ, na.rm = TRUE)) %>% arrange(desc(mean_pocc)) ``` ``` ## # A tibble: 255 × 2 ## org_name mean_pocc ## <chr> <dbl> ## 1 Barnet, Enfield And Haringey Mental Health 0.982 ## 2 Bradford District Care Trust 0.972 ## 3 Camden And Islington 0.953 ## 4 Hertfordshire Partnership University 0.953 ## 5 North Essex Partnership University 0.945 ## 6 Devon Partnership 0.945 ## 7 Sussex Partnership 0.942 ## 8 Essex Partnership University 0.942 ## 9 Manchester Mental Health And Social Care Trust 0.940 ## 10 Birmingham And Solihull Mental Health 0.935 ## # … with 245 more rows ``` ??? <check everyone had time to had a go Ok, here's the solution. You should have got something like the below, with the highest occupancy rate at Barnet MH Trust --- # Extension How many distinct organisation names are counted under the same organisation code? ```r beds_data %>% mutate(perc_occ = occ_av/beds_av) %>% group_by(org_name) %>% summarise(mean_pocc = mean(perc_occ, na.rm = TRUE), # Number of data points/rows by group * number = n()) %>% arrange(desc(mean_pocc)) ``` ``` ## # A tibble: 255 × 3 ## org_name mean_pocc number ## <chr> <dbl> <int> ## 1 Barnet, Enfield And Haringey Mental Health 0.982 21 ## 2 Bradford District Care Trust 0.972 3 ## 3 Camden And Islington 0.953 21 ## 4 Hertfordshire Partnership University 0.953 9 ## 5 North Essex Partnership University 0.945 15 ## 6 Devon Partnership 0.945 21 ## 7 Sussex Partnership 0.942 21 ## 8 Essex Partnership University 0.942 6 ## 9 Manchester Mental Health And Social Care Trust 0.940 14 ## 10 Birmingham And Solihull Mental Health 0.935 21 ## # … with 245 more rows ``` ??? If you want, you can tweak the code to add in another column, showing the number of observations used to compute the average. To do this, you can add another argument to the summarise verb, computing number, defined using the n() function, which returns the number of observations. --- # Extension How many columns associated with each observation? Adds another summary column. ```r beds_data %>% group_by(org_code) %>% summarise(# distinct number of org_name * distinct_number = n_distinct(org_name)) %>% filter(distinct_number > 1) %>% arrange(desc(distinct_number)) ``` ``` ## # A tibble: 22 × 2 ## org_code distinct_number ## <chr> <int> ## 1 RDE 3 ## 2 RTG 3 ## 3 R1K 2 ## 4 RA9 2 ## 5 RBN 2 ## 6 RD1 2 ## 7 RD8 2 ## 8 RDU 2 ## 9 RGM 2 ## 10 RGN 2 ## # … with 12 more rows ``` ??? You can also use the n_distinct function to return the number of distinct observations, if that's useful. Here we can calculate how many distinctive names are used by the same org_code. These are likely renames / mergers or transfers of hospital control, so not that interesting, but you can see the idea anyway --- .blue[# 6. select] select a subset of variables .darkgrey[from a existing data set] ```r beds_data %>% select(org_code, org_name) ``` ??? Lastly I want to just touch on the select verb, which is used to pick or drop columns or variables. So Select org code, org name here picks those columns, and those columns only, from the dataset. --- .blue[# 6. select] select a subset of variables .darkgrey[from a existing data set] ```r beds_data %>% select(-org_code) ``` .green[To remove a column] ??? to remove a column, you can prefix with a minus sign. This will return everything apart from the org code column. --- .blue[# 6. select] select a subset of variables .darkgrey[from a existing data set] ```r beds_data %>% select(1:3) ``` .green[You can also refer to columns by number, here 1:3 saves having to type 1, 2, 3] ??? You can also do it by number, so this would return the first three columns only --- .blue[# 6. select] select a subset of variables .darkgrey[from a existing data set] ```r beds_data %>% select(org_name, everything()) ``` .green[org_name appears first, followed by everything else - and isn't repeated] ??? You can also use the everything() function inside select to return all columns. That doesn't sound very useful, but what it does to is allow you to reorder. The code here move org_name to be the first column, and then everything after that follows --- # Not dplyr... ... but works within dplyr [Vectors](https://datascienceplus.com/vectors-and-functions-in-r/) are the simplest type of data structure in R. Simply put, a vector is a sequence of data elements of the same basic type. -- You can create a vector with function: c() for concatenate/combine ```r c(100, 80, 200) c("beds", "staff", "patients") c("beds", 80, "patients") # Mixing strings (characters) and numeric values results in all being strings ``` ??? The last thing that's useful with dplyr is character vectors. These are base-r, not the tidyverse, but are useful to create lists of things to pass to dplyr verbs. You create these with the c() function, for concatenate. --- # Very useful in these circumstances ```r c("Bradford District Care", "Bradford District Care Trust") ``` </br> Filter by org_name IN the lookup list. ```r beds_data %>% * filter(org_name %in% c("Bradford District Care", "Bradford District Care Trust")) ``` </br> Filter by org_name NOT IN the lookup list. ```r beds_data %>% * filter(!org_name %in% c("Bradford District Care", "Bradford District Care Trust")) ``` ??? One use for this would be to filter for specific characters in a particular column, so here we're filtering for specific names in the org_name list --- # For SQL users Code would look like this in SQL ```sql SELECT * FROM Table WHERE Colm IN ('Bradford District Care', 'Bradford District Care Trust') ``` And ```sql SELECT * FROM Table WHERE Colm NOT IN ('Bradford District Care', 'Bradford District Care Trust') ``` ??? If you're familiar with SQL - this is very similar to the SQL 'IN' operator Ok, that's the biggest but probably most important topic done, I think. Don't worry if you didn't get it all - it'll probably take a few goes and a bit of practice to get your head around a lot of this. Does anyone have any questions before the final short topic before the last break? If not, the next topic is about naming objects and conventions in R. --- #### This work is licensed as </br> Creative Commons </br> Attribution </br> ShareAlike 4.0 </br> International </br> To view a copy of this license, visit </br> https://creativecommons.org/licenses/by/4.0/